Mining and Migrating Interlinear Glossed Text
نویسنده
چکیده
1.0 Introduction The World Wide Web is rapidly becoming the primary source for disseminating data on the world’s languages. Language researchers, linguists and language communities are regularly posting a variety of language data to the Web, including lexicons, teaching materials, language recordings and transcriptions, language descriptions, and grammars. Also posted in large quantities are scholarly papers on language, posted regularly by innovative online-only publications (e.g., Snippets), by traditional linguistic publications offering online editions, and by linguists themselves. Of significant potential utility to linguists is the language and linguistic data contained in these documents, specifically data presented in the form of Interlinear Glossed Text (IGT). Generally, IGT consists of a line of language data, often broken down by morpheme, a line of grammatical and gloss information aligned with the text in the first line, and a line representing the translation. An example is shown in (1). Variations to this basic form abound, but its most frequent instantiation is this basic three-line format.
منابع مشابه
Extracting Interlinear Glossed Text from LaTeX Documents
We present texigt, a command-line tool for the extraction of structured linguistic data from LTEX source documents, and a language resource that has been generated using this tool: a corpus of interlinear glossed text (IGT) extracted from open access books published by Language Science Press. Extracted examples are represented in a simple XML format that is easy to process and can be used to va...
متن کاملTowards Creating Precision Grammars from Interlinear Glossed Text: Inferring Large-Scale Typological Properties
We propose to bring together two kinds of linguistic resources—interlinear glossed text (IGT) and a language-independent precision grammar resource—to automatically create precision grammars in the context of language documentation. This paper takes the first steps in that direction by extracting major-constituent word order and case system properties from IGT for a diverse sample of languages.
متن کاملEnriching Interlinear Text using Automatically Constructed Annotators
In this paper, we will demonstrate a system that shows great promise for creating Part-of-Speech taggers for languages with little to no curated resources available, and which needs no expert involvement. Interlinear Glossed Text (IGT) is a resource which is available for over 1,000 languages as part of the Online Database of INterlinear text (ODIN) (Lewis and Xia, 2010). Using nothing more tha...
متن کاملXigt: extensible interlinear glossed text for natural language processing
This paper presents Xigt, an extensible storage format for interlinear glossed text (IGT). We review design desiderata for such a format based on our own use cases as well as general best practices, and then explore existing representations of IGT through the lens of those desiderata. We give an overview of the data model and XML serialization of Xigt, and then describe its application to the u...
متن کاملEnhanced and Portable Dependency Projection Algorithms Using Interlinear Glossed Text
As most of the world’s languages are under-resourced, projection algorithms offer an enticing way to bootstrap the resources available for one resourcepoor language from a resource-rich language by means of parallel text and word alignment. These algorithms, however, make the strong assumption that the language pairs share common structures and that the parse trees will resemble one another. Th...
متن کامل